[KV Connector] Opt DecodeBenchConnector into SupportsHMA#41770
Merged
ywang96 merged 1 commit intovllm-project:mainfrom May 7, 2026
Merged
[KV Connector] Opt DecodeBenchConnector into SupportsHMA#41770ywang96 merged 1 commit intovllm-project:mainfrom
ywang96 merged 1 commit intovllm-project:mainfrom
Conversation
Contributor
There was a problem hiding this comment.
Code Review
This pull request updates the DecodeBenchConnector to support Hierarchical Memory Access (HMA) by inheriting from the SupportsHMA interface and implementing the request_finished_all_groups method for proper cleanup. I have no feedback to provide as there were no review comments to evaluate.
ywang96
approved these changes
May 6, 2026
Previously the HMA (hybrid KV cache manager) layer refused to activate when DecodeBenchConnector was in use, because the connector did not advertise SupportsHMA. That forced decode-only benchmark recipes to pass --disable-hybrid-kv-cache-manager, which collapsed hybrid-model KV cache groups (SWA / MLA compress=4 / MLA compress=128 / sparse indexer) into a single uniform page size via unify_kv_cache_spec_ page_size, throwing away the compression savings and capping concurrent capacity on hybrid models (e.g. DeepSeek-V4 saw ~43 concurrent 8k/1k requests instead of the model's true ceiling). This connector is a dummy fill that owns no external per-block state, so the HMA path has nothing extra to do. Implementation is minimal: - Inherit from SupportsHMA. - Implement request_finished_all_groups: delegates to the same scheduler.request_finished() cleanup as the single-group variant, ignoring block_ids (no per-block state to release). With this change, recipes can drop --disable-hybrid-kv-cache-manager and let HMA size each KV cache group correctly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Zijing Liu <liuzijing2014@gmail.com>
auto-merge was automatically disabled
May 7, 2026 22:51
Head branch was pushed to by a user without write access
22e669d to
9404a3d
Compare
whytem
pushed a commit
to whytem/vllm
that referenced
this pull request
May 8, 2026
…t#41770) Signed-off-by: Zijing Liu <liuzijing2014@gmail.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
libinta
pushed a commit
to libinta/vllm
that referenced
this pull request
May 8, 2026
…t#41770) Signed-off-by: Zijing Liu <liuzijing2014@gmail.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Libin Tang <libin.tang@intel.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Opt
DecodeBenchConnectorintoSupportsHMAso decode-only benchmark recipes can run with the hybrid KV cache manager enabled. Without this, users must pass--disable-hybrid-kv-cache-manager, which collapses hybrid-model KV cache groups (SWA / MLA compress=4 / MLA compress=128 / sparse indexer) into a single uniform page size and throws away the compression savings.Implementation is minimal because this connector is a dummy fill that owns no external per-block state:
SupportsHMA.request_finished_all_groups: same cleanup as the single-grouprequest_finished(delegates toconnector_scheduler.request_finished());block_idsis ignored because there is no per-block external state to release.Not a duplicate
Searched open PRs for
DecodeBenchConnector,SupportsHMA, anddecode_bench_connector(via the GitHub search API). No PR targets this connector. The closest neighbor, #41644 ("Keep HMA enabled for supported KV connectors"), is orthogonal — it changes the default HMA decision invllm/config/vllm.pyfor connectors that already declareSupportsHMA. This PR adds one more connector to that supported set; the two changes compose.Test plan
Hardware: 1× node with 4× GB200, single-node DEP=4 (DP=4, EP=4). Model: DeepSeek-V4-Flash with
--load-format dummyand--tokenizer-mode deepseek_v4. Connector:DecodeBenchConnector, HMA explicitly enabled via--no-disable-hybrid-kv-cache-manager.Same
vllm servecommand in both runs:Without the PR (only difference:
class DecodeBenchConnector(KVConnectorBase_V1)— noSupportsHMAmixin)Crashes at engine init in
KVConnectorFactory.create_connector_v1():All four DP ranks fail identically; the engine cores then exit with
RuntimeError: Worker failed with error '...'.With the PR
Server starts cleanly. HMA is active:
Functional sanity check via
vllm bench serve(random ISL=4000, OSL=100, 32 prompts, max-concurrency=8):